Measuring Speech Activity
نویسنده
چکیده
This report discusses the algorithm described in ITU-T Recommendation P.56 for measuring the active speech level. Method B in P.56 determines a speech activity factor representing the fraction of time that the signal is considered to be active speech (as opposed to background idle noise) and the corresponding active level for the speech part of the signal. The basic algorithm generates an envelope value at each sample time. The envelope values are compared with a discrete set of thresholds. The (approximate) active speech level is determined by interpolating in the log domain between the threshold values. In this report we assess the effects on the speech active level due to interpolation. Recommendation P.56 allows for sampling rates as low as 600 Hz. Results for subsampled data are compared with those calculated at the full speech sampling rate. Measuring Speech Activity 1 Measuring Speech Activity Speech activity measurement involves determining the fraction of time that a signal contains active speech and the speech level while speech is active. Knowledge of the speech activity is important in speech signal measurements. For speech data bases, it is important to ensure that undue leading and trailing non-speech be excised and that the speech level be properly scaled based on the peak signal level and the active speech level [1]. For testing speech coders with environmental noise, artificial test signals are created by adding recorded background noise to clean speech segments. The signal-to-noise ratio for such speech-plus-noise signals is determined as the ratio of the active level for the speech to the rms level for the recorded noise [1]. In the speech coding community, considerable research effort is being expended on variable rate coders or discontinuous transmission systems that attempt to economize on average bit rate and/or power consumption by exploiting the fact the speech occurs in talk spurts. The efficacy of such techniques can be compared to speech activity measurements. Specifications for the measurement of the level of speech signals are given in ITU-T (International Telecommunication Union, Telecommunication Standardization Sector) Recommendation P.56 [2] as Method B. The measurement of the active level of speech takes into account the fact that speech may contain embedded pauses. Experiments have shown that listeners will perceive a pause in the speech if there is a gap of 350–400 ms or larger [3]. If such gaps are due to pauses between phrases or pauses to emphasize words, they are termed grammatical pauses. Grammatical pauses and other long gaps with idle noise do not affect the perceived loudness and are not counted as active speech. The smaller gaps inherent in any utterance are termed structural pauses and are counted as part of the active speech segment. The output of the speech activity algorithm is a speech activity factor representing the fraction of the signal that can be considered to be active speech and the corresponding active speech level for the speech part of the signal. An implementation of a Speech Voltmeter using the algorithm in Recommendation P.56 is part of the ITU-T Software Tools Library [4][5] referred to here as ITU-T STL. The algorithm under discussion presents a active level information for an utterance as a whole. Measuring Speech Activity 2 Other speech level measurements rely on an immediate indication of the speech level and are meant for a real-time indication of level (see the discussion of Method A in [2]). An example is the volume unit (VU) meter often seen on both professional and consumer audio equipment. 1 Envelope Calculation The speech activity algorithm calculates an “envelope” for the speech signal. This is a double exponential filtering of the magnitude of the speech sample values, pi = gpi−1 + (1− g)|xi|, qi = gqi−1 + (1− g)|pi|. (1) The envelope qi is calculated starting with zero initial conditions.1 The parameter g is determined by the time constant of the averaging and is set to
منابع مشابه
A New Algorithm for Voice Activity Detection Based on Wavelet Packets (RESEARCH NOTE)
Speech constitutes much of the communicated information; most other perceived audio signals do not carry nearly as much information. Indeed, much of the non-speech signals maybe classified as ‘noise’ in human communication. The process of separating conversational speech and noise is termed voice activity detection (VAD). This paper describes a new approach to VAD which is based on the Wavelet ...
متن کاملThe effect of redesign workstation on Speech Interference Level (SIL) among bank tellers
Abstract Background: There is always an interaction between man and his environment that can be the cause of physical, physiological and psychological stress on people and also cause discomfort, annoyance, and have direct and indirect effects on their performance and productivity, health and safety. People in their workplace are exposed to many factors related to work activities and environmen...
متن کاملMeasuring the perceived importance of time- and frequency-divided speech blocks for transmitting over packet networks
This paper presents a way to calculate the perceived importance of speech segments as a single value criterion, using a linear regression model. Unlike the commonly used voice activity detection (VAD) algorithms, this method allows us to obtain a finer priority granularity of speech segments. This can be used in conjunction with frequency scalable speech coding techniques and IP QoS techniques ...
متن کاملThe Effect of Region of ’Activity’ Measures on Automatic Audio-Visual Speech Recognition
Automatic Speech Recognition (ASR) has made tremendous progress in the last few decades. Even so, audio-only speech recognition (A-ASR) does not work well in noisy environments. The standard approach to dealing with this shortcoming is to use visual information along with the audio. Many approaches to using the visual modality have been devised. In this paper, I propose a method that will try t...
متن کاملEffective visually-derived Wiener filtering for audio-visual speech processing
This work presents a novel approach to speech enhancement by exploiting the bimodality of speech and the correlation that exists between audio and visual speech features. For speech enhancement, a visually-derived Wiener filter is developed. This obtains clean speech statistics from visual features by modelling their joint density and making a maximum a posteriori estimate of clean audio from v...
متن کامل